[LoadStoreVectorizer] Propagate alignment through contiguous chain #145733

dakersnar · 2025-06-25T16:12:07Z

At this point in the vectorization pass, we are guaranteed to have a contiguous chain with defined offsets for each element. Using this information, we can derive and upgrade alignment for elements in the chain based on their offset from previous well-aligned elements. This enables vectorization of chains that are longer than the maximum vector length of the target. This algorithm is also robust to the head of the chain not being well-aligned; if we find a better alignment while iterating from the beginning to the end of the chain, we will use that alignment moving forward.

… improve vectorization

llvmbot · 2025-06-25T16:12:30Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: Drew Kersnar (dakersnar)

Changes

At this point in the vectorization pass, we are guaranteed to have a contiguous chain with defined offsets for each element. Using this information, we can derive and upgrade alignment for elements in the chain based on their offset from previous well-aligned elements. This enables vectorization of chains that are longer than the maximum vector length of the target. This algorithm is also robust to the head of the chain not being well-aligned; if we find a better alignment while iterating from the beginning to the end of the chain, we will use that alignment moving forward.

Full diff: https://github.com/llvm/llvm-project/pull/145733.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+35)
(added) llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll (+296)

diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 89f63c3b66aad..e14a936b764e5 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -343,6 +343,9 @@ class Vectorizer {
   /// Postcondition: For all i, ret[i][0].second == 0, because the first instr
   /// in the chain is the leader, and an instr touches distance 0 from itself.
   std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+  /// Propagates the best alignment in a chain of contiguous accesses
+  void propagateBestAlignmentsInChain(ArrayRef<ChainElem> C) const;
 };
 
 class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -716,6 +719,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
   unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
   unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
 
+  // We know that the accesses are contiguous. Propagate alignment
+  // information so that slices of the chain can still be vectorized.
+  propagateBestAlignmentsInChain(C);
+  LLVM_DEBUG({
+    dbgs() << "LSV: Chain after alignment propagation:\n";
+    dumpChain(C);
+  });
+
   std::vector<Chain> Ret;
   for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
     // Find candidate chains of size not greater than the largest vector reg.
@@ -1634,3 +1645,27 @@ std::optional<APInt> Vectorizer::getConstantOffset(Value *PtrA, Value *PtrB,
         .sextOrTrunc(OrigBitWidth);
   return std::nullopt;
 }
+
+void Vectorizer::propagateBestAlignmentsInChain(ArrayRef<ChainElem> C) const {
+  ChainElem BestAlignedElem = C[0];
+  Align BestAlignSoFar = getLoadStoreAlignment(C[0].Inst);
+
+  for (const ChainElem &E : C) {
+    Align OrigAlign = getLoadStoreAlignment(E.Inst);
+    if (OrigAlign > BestAlignSoFar) {
+      BestAlignedElem = E;
+      BestAlignSoFar = OrigAlign;
+    }
+
+    APInt OffsetFromBestAlignedElem =
+        E.OffsetFromLeader - BestAlignedElem.OffsetFromLeader;
+    assert(OffsetFromBestAlignedElem.isNonNegative());
+    // commonAlignment is equivalent to a greatest common power-of-two divisor;
+    // it returns the largest power of 2 that divides both A and B.
+    Align NewAlign = commonAlignment(
+        BestAlignSoFar, OffsetFromBestAlignedElem.getLimitedValue());
+    if (NewAlign > OrigAlign)
+      setLoadStoreAlignment(E.Inst, NewAlign);
+  }
+  return;
+}
diff --git a/llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll b/llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll
new file mode 100644
index 0000000000000..a1878dc051d99
--- /dev/null
+++ b/llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll
@@ -0,0 +1,296 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -passes=load-store-vectorizer -S < %s | FileCheck %s
+
+; The IR has the first float3 labeled with align 16, and that 16 should
+; be propagated such that the second set of 4 values
+; can also be vectorized together.
+%struct.float3 = type { float, float, float }
+%struct.S1 = type { %struct.float3, %struct.float3, i32, i32 }
+
+define void @testStore(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStore(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    store <4 x float> zeroinitializer, ptr [[TMP0]], align 16
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S1:%.*]], ptr [[TMP0]], i64 0, i32 1, i32 1
+; CHECK-NEXT:    store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    ret void
+;
+  store float 0.000000e+00, ptr %1, align 16
+  %getElem = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 1
+  store float 0.000000e+00, ptr %getElem, align 4
+  %getElem8 = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 2
+  store float 0.000000e+00, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1
+  store float 0.000000e+00, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 1
+  store float 0.000000e+00, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 2
+  store float 0.000000e+00, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 2
+  store i32 0, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 3
+  store i32 0, ptr %getElem13, align 4
+  ret void
+}
+
+define void @testLoad(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoad(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    [[TMP2:%.*]] = load <4 x float>, ptr [[TMP0]], align 16
+; CHECK-NEXT:    [[L11:%.*]] = extractelement <4 x float> [[TMP2]], i32 0
+; CHECK-NEXT:    [[L22:%.*]] = extractelement <4 x float> [[TMP2]], i32 1
+; CHECK-NEXT:    [[L33:%.*]] = extractelement <4 x float> [[TMP2]], i32 2
+; CHECK-NEXT:    [[L44:%.*]] = extractelement <4 x float> [[TMP2]], i32 3
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S1:%.*]], ptr [[TMP0]], i64 0, i32 1, i32 1
+; CHECK-NEXT:    [[TMP3:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    [[L55:%.*]] = extractelement <4 x i32> [[TMP3]], i32 0
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast i32 [[L55]] to float
+; CHECK-NEXT:    [[L66:%.*]] = extractelement <4 x i32> [[TMP3]], i32 1
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i32 [[L66]] to float
+; CHECK-NEXT:    [[L77:%.*]] = extractelement <4 x i32> [[TMP3]], i32 2
+; CHECK-NEXT:    [[L88:%.*]] = extractelement <4 x i32> [[TMP3]], i32 3
+; CHECK-NEXT:    ret void
+;
+  %l1 = load float, ptr %1, align 16
+  %getElem = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 1
+  %l2 = load float, ptr %getElem, align 4
+  %getElem8 = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 2
+  %l3 = load float, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1
+  %l4 = load float, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 1
+  %l5 = load float, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 2
+  %l6 = load float, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 2
+  %l7 = load i32, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 3
+  %l8 = load i32, ptr %getElem13, align 4
+  ret void
+}
+
+; Also, test without the struct geps, to see if it still works with i8 geps/ptradd
+
+define void @testStorei8(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStorei8(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    store <4 x float> zeroinitializer, ptr [[TMP0]], align 16
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 16
+; CHECK-NEXT:    store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    ret void
+;
+  store float 0.000000e+00, ptr %1, align 16
+  %getElem = getelementptr inbounds i8, ptr %1, i64 4
+  store float 0.000000e+00, ptr %getElem, align 4
+  %getElem8 = getelementptr inbounds i8, ptr %1, i64 8
+  store float 0.000000e+00, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds i8, ptr %1, i64 12
+  store float 0.000000e+00, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds i8, ptr %1, i64 16
+  store float 0.000000e+00, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds i8, ptr %1, i64 20
+  store float 0.000000e+00, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds i8, ptr %1, i64 24
+  store i32 0, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds i8, ptr %1, i64 28
+  store i32 0, ptr %getElem13, align 4
+  ret void
+}
+
+define void @testLoadi8(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoadi8(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    [[TMP2:%.*]] = load <4 x float>, ptr [[TMP0]], align 16
+; CHECK-NEXT:    [[L11:%.*]] = extractelement <4 x float> [[TMP2]], i32 0
+; CHECK-NEXT:    [[L22:%.*]] = extractelement <4 x float> [[TMP2]], i32 1
+; CHECK-NEXT:    [[L33:%.*]] = extractelement <4 x float> [[TMP2]], i32 2
+; CHECK-NEXT:    [[L44:%.*]] = extractelement <4 x float> [[TMP2]], i32 3
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 16
+; CHECK-NEXT:    [[TMP3:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    [[L55:%.*]] = extractelement <4 x i32> [[TMP3]], i32 0
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast i32 [[L55]] to float
+; CHECK-NEXT:    [[L66:%.*]] = extractelement <4 x i32> [[TMP3]], i32 1
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i32 [[L66]] to float
+; CHECK-NEXT:    [[L77:%.*]] = extractelement <4 x i32> [[TMP3]], i32 2
+; CHECK-NEXT:    [[L88:%.*]] = extractelement <4 x i32> [[TMP3]], i32 3
+; CHECK-NEXT:    ret void
+;
+  %l1 = load float, ptr %1, align 16
+  %getElem = getelementptr inbounds i8, ptr %1, i64 4
+  %l2 = load float, ptr %getElem, align 4
+  %getElem8 = getelementptr inbounds i8, ptr %1, i64 8
+  %l3 = load float, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds i8, ptr %1, i64 12
+  %l4 = load float, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds i8, ptr %1, i64 16
+  %l5 = load float, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds i8, ptr %1, i64 20
+  %l6 = load float, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds i8, ptr %1, i64 24
+  %l7 = load i32, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds i8, ptr %1, i64 28
+  %l8 = load i32, ptr %getElem13, align 4
+  ret void
+}
+
+
+; This version of the test adjusts the struct to hold two i32s at the beginning,
+; but still assumes that the first float3 is 16 aligned. If the alignment
+; propagation works correctly, it should be able to load this struct in three
+; loads: a 2x32, a 4x32, and a 4x32. Without the alignment propagation, the last
+; 4x32 will instead be a 2x32 and a 2x32
+%struct.S2 = type { i32, i32, %struct.float3, %struct.float3, i32, i32 }
+
+define void @testStore_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStore_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    store <2 x i32> zeroinitializer, ptr [[TMP0]], align 8
+; CHECK-NEXT:    [[GETELEM1:%.*]] = getelementptr inbounds [[STRUCT_S2:%.*]], ptr [[TMP0]], i64 0, i32 2
+; CHECK-NEXT:    store <4 x float> zeroinitializer, ptr [[GETELEM1]], align 16
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S2]], ptr [[TMP0]], i64 0, i32 3, i32 1
+; CHECK-NEXT:    store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    ret void
+;
+  store i32 0, ptr %1, align 8
+  %getElem = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 1
+  store i32 0, ptr %getElem, align 4
+  %getElem1 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2
+  store float 0.000000e+00, ptr %getElem1, align 16
+  %getElem2 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 1
+  store float 0.000000e+00, ptr %getElem2, align 4
+  %getElem8 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 2
+  store float 0.000000e+00, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3
+  store float 0.000000e+00, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 1
+  store float 0.000000e+00, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 2
+  store float 0.000000e+00, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 4
+  store i32 0, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 5
+  store i32 0, ptr %getElem13, align 4
+  ret void
+}
+
+define void @testLoad_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoad_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    [[TMP2:%.*]] = load <2 x i32>, ptr [[TMP0]], align 8
+; CHECK-NEXT:    [[L1:%.*]] = extractelement <2 x i32> [[TMP2]], i32 0
+; CHECK-NEXT:    [[L22:%.*]] = extractelement <2 x i32> [[TMP2]], i32 1
+; CHECK-NEXT:    [[GETELEM1:%.*]] = getelementptr inbounds [[STRUCT_S2:%.*]], ptr [[TMP0]], i64 0, i32 2
+; CHECK-NEXT:    [[TMP3:%.*]] = load <4 x float>, ptr [[GETELEM1]], align 16
+; CHECK-NEXT:    [[L33:%.*]] = extractelement <4 x float> [[TMP3]], i32 0
+; CHECK-NEXT:    [[L44:%.*]] = extractelement <4 x float> [[TMP3]], i32 1
+; CHECK-NEXT:    [[L55:%.*]] = extractelement <4 x float> [[TMP3]], i32 2
+; CHECK-NEXT:    [[L66:%.*]] = extractelement <4 x float> [[TMP3]], i32 3
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S2]], ptr [[TMP0]], i64 0, i32 3, i32 1
+; CHECK-NEXT:    [[TMP4:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    [[L77:%.*]] = extractelement <4 x i32> [[TMP4]], i32 0
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i32 [[L77]] to float
+; CHECK-NEXT:    [[L88:%.*]] = extractelement <4 x i32> [[TMP4]], i32 1
+; CHECK-NEXT:    [[TMP6:%.*]] = bitcast i32 [[L88]] to float
+; CHECK-NEXT:    [[L99:%.*]] = extractelement <4 x i32> [[TMP4]], i32 2
+; CHECK-NEXT:    [[L010:%.*]] = extractelement <4 x i32> [[TMP4]], i32 3
+; CHECK-NEXT:    ret void
+;
+  %l = load i32, ptr %1, align 8
+  %getElem = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 1
+  %l2 = load i32, ptr %getElem, align 4
+  %getElem1 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2
+  %l3 = load float, ptr %getElem1, align 16
+  %getElem2 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 1
+  %l4 = load float, ptr %getElem2, align 4
+  %getElem8 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 2
+  %l5 = load float, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3
+  %l6 = load float, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 1
+  %l7 = load float, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 2
+  %l8 = load float, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 4
+  %l9 = load i32, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 5
+  %l0 = load i32, ptr %getElem13, align 4
+  ret void
+}
+
+; Also, test without the struct geps, to see if it still works with i8 geps/ptradd
+
+define void @testStorei8_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStorei8_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    store <2 x i32> zeroinitializer, ptr [[TMP0]], align 8
+; CHECK-NEXT:    [[GETELEM1:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 8
+; CHECK-NEXT:    store <4 x float> zeroinitializer, ptr [[GETELEM1]], align 16
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 24
+; CHECK-NEXT:    store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    ret void
+;
+  store i32 0, ptr %1, align 8
+  %getElem = getelementptr inbounds i8, ptr %1, i64 4
+  store i32 0, ptr %getElem, align 4
+  %getElem1 = getelementptr inbounds i8, ptr %1, i64 8
+  store float 0.000000e+00, ptr %getElem1, align 16
+  %getElem2 = getelementptr inbounds i8, ptr %1, i64 12
+  store float 0.000000e+00, ptr %getElem2, align 4
+  %getElem8 = getelementptr inbounds i8, ptr %1, i64 16
+  store float 0.000000e+00, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds i8, ptr %1, i64 20
+  store float 0.000000e+00, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds i8, ptr %1, i64 24
+  store float 0.000000e+00, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds i8, ptr %1, i64 28
+  store float 0.000000e+00, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds i8, ptr %1, i64 32
+  store i32 0, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds i8, ptr %1, i64 36
+  store i32 0, ptr %getElem13, align 4
+  ret void
+}
+
+define void @testLoadi8_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoadi8_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    [[TMP2:%.*]] = load <2 x i32>, ptr [[TMP0]], align 8
+; CHECK-NEXT:    [[L1:%.*]] = extractelement <2 x i32> [[TMP2]], i32 0
+; CHECK-NEXT:    [[L22:%.*]] = extractelement <2 x i32> [[TMP2]], i32 1
+; CHECK-NEXT:    [[GETELEM1:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 8
+; CHECK-NEXT:    [[TMP3:%.*]] = load <4 x float>, ptr [[GETELEM1]], align 16
+; CHECK-NEXT:    [[L33:%.*]] = extractelement <4 x float> [[TMP3]], i32 0
+; CHECK-NEXT:    [[L44:%.*]] = extractelement <4 x float> [[TMP3]], i32 1
+; CHECK-NEXT:    [[L55:%.*]] = extractelement <4 x float> [[TMP3]], i32 2
+; CHECK-NEXT:    [[L66:%.*]] = extractelement <4 x float> [[TMP3]], i32 3
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 24
+; CHECK-NEXT:    [[TMP4:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    [[L77:%.*]] = extractelement <4 x i32> [[TMP4]], i32 0
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i32 [[L77]] to float
+; CHECK-NEXT:    [[L88:%.*]] = extractelement <4 x i32> [[TMP4]], i32 1
+; CHECK-NEXT:    [[TMP6:%.*]] = bitcast i32 [[L88]] to float
+; CHECK-NEXT:    [[L99:%.*]] = extractelement <4 x i32> [[TMP4]], i32 2
+; CHECK-NEXT:    [[L010:%.*]] = extractelement <4 x i32> [[TMP4]], i32 3
+; CHECK-NEXT:    ret void
+;
+  %l = load i32, ptr %1, align 8
+  %getElem = getelementptr inbounds i8, ptr %1, i64 4
+  %l2 = load i32, ptr %getElem, align 4
+  %getElem1 = getelementptr inbounds i8, ptr %1, i64 8
+  %l3 = load float, ptr %getElem1, align 16
+  %getElem2 = getelementptr inbounds i8, ptr %1, i64 12
+  %l4 = load float, ptr %getElem2, align 4
+  %getElem8 = getelementptr inbounds i8, ptr %1, i64 16
+  %l5 = load float, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds i8, ptr %1, i64 20
+  %l6 = load float, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds i8, ptr %1, i64 24
+  %l7 = load float, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds i8, ptr %1, i64 28
+  %l8 = load float, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds i8, ptr %1, i64 32
+  %l9 = load i32, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds i8, ptr %1, i64 36
+  %l0 = load i32, ptr %getElem13, align 4
+  ret void
+}

AlexMaclean

Nice

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

…ests

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

AlexMaclean

LGTM, but please wait for review by someone with more LSV experience before landing.

dakersnar · 2025-06-27T22:37:01Z

Adding @michalpaszkowski to reviewers since I saw you reviewed a recently merged LSV change. If you can think of anyone else who would be a better fit to review this let me know :)

arsenm · 2025-07-02T02:29:13Z

llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll

+  %l7 = load i32, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds i8, ptr %1, i64 28
+  %l8 = load i32, ptr %getElem13, align 4
+  ret void


Can you use more representative tests with some uses? As is these all optimize to memset or no-op

Slight pushback, my understanding is that lit tests are most useful when they are minimal reproducers of the problem the optimization is targeting. Adding uses would not really change the nature of this optimization. Tests like llvm/test/Transforms/LoadStoreVectorizer/NVPTX/vectorize_i1.ll follow this thinking.

If you think it would be better, I could combine each pair of load and store tests into individual tests, storing the result of the loads. Other LSV tests use that pattern a lot.

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

arsenm · 2025-07-02T02:30:59Z

llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll

+; The IR has the first float3 labeled with align 16, and that 16 should
+; be propagated such that the second set of 4 values
+; can also be vectorized together.


Is this papering over a missed middle end optimization? The middle end should have inferred the pointer argument to align 16, and then each successive access should already have a refined alignment derived from that

I think this is nice to have for a variety of edge cases, but the specific motivator for this change is based on InstCombine's handling of nested structs. You are correct that if loads/stores are unpacked from aligned loads/stores of aggregates, alignment will be propagated to the unpacked elements. However, with nested structs, InstCombine unpacks one layer at a time, losing alignment context in between passes over the worklist.

Before IC:

; RUN: opt < %s -passes=instcombine -S | FileCheck %s %struct.S1 = type { %struct.float3, %struct.float3, i32, i32 } %struct.float3 = type { float, float, float } define void @_Z7init_s1P2S1S0_(ptr noundef %v, ptr noundef %vout) { %v.addr = alloca ptr, align 8 %vout.addr = alloca ptr, align 8 store ptr %v, ptr %v.addr, align 8 store ptr %vout, ptr %vout.addr, align 8 %tmp = load ptr, ptr %vout.addr, align 8 %tmp1 = load ptr, ptr %v.addr, align 8 %1 = load %struct.S1, ptr %tmp1, align 16 store %struct.S1 %1, ptr %tmp, align 16 ret void }

After IC:

define void @_Z7init_s1P2S1S0_(ptr noundef %v, ptr noundef %vout) { %.unpack.unpack = load float, ptr %v, align 16 %.unpack.elt7 = getelementptr inbounds nuw i8, ptr %v, i64 4 %.unpack.unpack8 = load float, ptr %.unpack.elt7, align 4 %.unpack.elt9 = getelementptr inbounds nuw i8, ptr %v, i64 8 %.unpack.unpack10 = load float, ptr %.unpack.elt9, align 8 %.elt1 = getelementptr inbounds nuw i8, ptr %v, i64 12 %.unpack2.unpack = load float, ptr %.elt1, align 4 %.unpack2.elt12 = getelementptr inbounds nuw i8, ptr %v, i64 16 %.unpack2.unpack13 = load float, ptr %.unpack2.elt12, align 4 ; <----------- this should be align 16 %.unpack2.elt14 = getelementptr inbounds nuw i8, ptr %v, i64 20 %.unpack2.unpack15 = load float, ptr %.unpack2.elt14, align 4 %.elt3 = getelementptr inbounds nuw i8, ptr %v, i64 24 %.unpack4 = load i32, ptr %.elt3, align 8 %.elt5 = getelementptr inbounds nuw i8, ptr %v, i64 28 %.unpack6 = load i32, ptr %.elt5, align 4 store float %.unpack.unpack, ptr %vout, align 16 %vout.repack23 = getelementptr inbounds nuw i8, ptr %vout, i64 4 store float %.unpack.unpack8, ptr %vout.repack23, align 4 %vout.repack25 = getelementptr inbounds nuw i8, ptr %vout, i64 8 store float %.unpack.unpack10, ptr %vout.repack25, align 8 %vout.repack17 = getelementptr inbounds nuw i8, ptr %vout, i64 12 store float %.unpack2.unpack, ptr %vout.repack17, align 4 %vout.repack17.repack27 = getelementptr inbounds nuw i8, ptr %vout, i64 16 store float %.unpack2.unpack13, ptr %vout.repack17.repack27, align 4 %vout.repack17.repack29 = getelementptr inbounds nuw i8, ptr %vout, i64 20 store float %.unpack2.unpack15, ptr %vout.repack17.repack29, align 4 %vout.repack19 = getelementptr inbounds nuw i8, ptr %vout, i64 24 store i32 %.unpack4, ptr %vout.repack19, align 8 %vout.repack21 = getelementptr inbounds nuw i8, ptr %vout, i64 28 store i32 %.unpack6, ptr %vout.repack21, align 4 ret void }

To visualize what's happening under the hood, InstCombine is unpacking the load in stages like this:
%1 = load %struct.S1, ptr %tmp1, align 16
->
load struct.float3 align 16
load struct.float3 align 4
load i32 align 8
load i32 align 4
->
load float align 16
load float align 4
load float align 8
load float align 4
load float align 4
load float align 4
load i32 align 8
load i32 align 4

If the instcombine is fixed to propagate the alignment properly, will this patch still be useful/needed? If not, then I would agree with Matt that we should try fixing the source of the problem.

This instcombine case is the only one I currently know of those benefits from this specific optimization, but I would suspect there could be a different way to arrive at a pattern like this.

But let's ignore those hypotheticals, because I understand the desire to stick within the realm of known use cases. The question is, are we comfortable changing the unpackLoadToAggregate algorithm in InstCombine to recurse through nested structs?

llvm-project/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp

Lines 788 to 801 in 8c32f95

for (uint64_t i = 0; i < NumElements; i++) {

Value *Indices[2] = {

Zero,

ConstantInt::get(IdxType, i),

};

auto *Ptr = IC.Builder.CreateInBoundsGEP(AT, Addr, ArrayRef(Indices),

Name + ".elt");

auto EltAlign = commonAlignment(Align, Offset.getKnownMinValue());

auto *L = IC.Builder.CreateAlignedLoad(AT->getElementType(), Ptr,

EltAlign, Name + ".unpack");

L->setAAMetadata(LI.getAAMetadata());

V = IC.Builder.CreateInsertValue(V, L, i);

Offset += EltSize;

}

I feel like implementing that would be a bit of a mess, and leads to questions like "how many layers deep should it recurse"? Unpacking one layer at a time and then adding the nested struct elements to the IC worklist to be independently operated on in a later iteration is a much cleaner solution and feels more in line with the design philosophy of InstCombine. I could be wrong though, open to feedback or challenges to my assumptions.

And just to be clear, in case my previous explanation was confusing, the reason it would have to recurse through all layers at once is because we cannot store the knowledge that "the second element of load struct.float3 align 4 is aligned to 16" on the instruction; there isn't a syntax available to express that.

Edit: We would also have to change unpackStoreToAggregate to do this too, further increasing the code complexity

This has the align 16 with opt -attributor-enable=module -O3. The default attribute passes do a worse job it seems

Analysis implies that we collect the information, but we're not making any actual changes to the IR.
Collecting the chains would fit that description. Collecting the chains and applying new alignment would become a transformation.

Yeah you're totally right, what I meant was in this case I think the alignment upgrading transformation is a bit of an implementation hack to carry the computed alignment through to the end of the pass, at which point it will be assigned to the newly created vectorized load. Theoretically we could instead add an "alignment" field to each ChainElem in the chain data structure and upgrade the alignment in memory without touching the IR until the final vectorized load/store is created (do we theoretically like this idea more?). Just wanted to point out this detail in case that changes the equation.

I appreciate your patience answering my questions and entertaining my half-baked ideas and hypothetical scenarios.

Totally, this is worth thinking through!

I don't think this needs to be expensive? I'd expect the implementation in InferAlignments would basically look like this: Walk the block and for all accesses on constant offset GEPs remember the largest implied alignment for the base (without offset) and use that to increase alignment on future accesses with the same base.

This is interesting. I can explore this idea to see how feasible it is. I think if we wanted something that could perform as well as this LSV optimization, it need to call getUnderlyingObject and getConstantOffset for each load/store. I can look into the time complexity of that, but my hunch is it would still end up O(n^2).

The problem is that I think if we want to avoid those expensive calls and limit the analysis to "shallow" geps that offset directly off of aligned base pointers, we will miss a lot of cases. There are transformations like StraightLineStrengthReduce that build chains of geps that increase the instruction depth needed to find the underlying object.

what I meant was in this case I think the alignment upgrading transformation is a bit of an implementation hack to carry the computed alignment through to the end of the pass, at which point it will be assigned to the newly created vectorized load.

I think it may be another way to say what I said above -- we've established that the change is sensible, it just may not be the best place/time for it to be done in LSV.

So, yes, at the moment the upgraded alignment is only used by LSV, but that's exactly the reason why it may not be the best place for it. Having better (or just known) alignment info is generally helpful and we want to make it visible to other passes, independently of LSV. E.g. there may be cases where LSV would not be able to help much, but we could still improve lowering of non-vectorized loads/stores.

Having better (or just known) alignment info is generally helpful and we want to make it visible to other passes, independently of LSV.

Are there any existing passes in particular that would benefit from this upgraded alignment? If there is nothing currently existing that would take advantage of it, I lean towards moving forward with this improvement in the LSV for now and considering generalizing it in the future.

I think the compile time tradeoffs that come with implementing it in InstCombine or InferAlignments would be not worth it. For InstCombine specifically, we have a lot of practical issues with the compile time stress it puts on our compiler at NVIDIA, so we tend to avoid adding complexity to it when possible.

Are there any existing passes in particular that would benefit from this upgraded alignment?

Isel benefits from better alignment information in general.

I think the compile time tradeoffs that come with implementing it in InstCombine or InferAlignments would be not worth it. For InstCombine specifically, we have a lot of practical issues with the compile time stress it puts on our compiler at NVIDIA, so we tend to avoid adding complexity to it when possible.

We should definitely not infer better alignments in InstCombine. We specifically split out the InferAlignments pass out of InstCombine to reduce compile-time from repeatedly inferring alignment.

I think with the scheme I proposed above doing this in InferAlignment should be close to free, but of course we won't know until we try.

The problem is that I think if we want to avoid those expensive calls and limit the analysis to "shallow" geps that offset directly off of aligned base pointers, we will miss a lot of cases. There are transformations like StraightLineStrengthReduce that build chains of geps that increase the instruction depth needed to find the underlying object.

I'm not sure we'll be losing much in practice. I'd generally expect that after CSE, cases where alignment can be increased in this manner will have the form of constant offset GEPs from a common base. All of the tests you added in this PR are of this form. Did you see motivating cases where this does not hold?

arsenm · 2025-07-02T02:31:33Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

+  PropagateAlignments(C);
+  PropagateAlignments(reverse(C));


Why does this need to go through twice? I'd expect each load and store to have started with an optimally computed alignment

test_forward_and_reverse in the test file demonstrates why going backwards could theoretically be useful. If you have a well-aligned element later in the chain, propagating the alignment backwards could improve earlier elements in the chain.

I haven't thought of a specific end-to-end test that would trigger this, but I think it can't hurt to include.

What if the alignment info at the beginning/end of the chain conflict? Which one should be believed?

I think we should only propagate alignment info in one direction, forward, as it's the alignment of the base pointer that's the ground truth there.

What if the alignment info at the beginning/end of the chain conflict?

I think this isn't possible if the input IR is correct, right? For example, if one piece of analysis proves that a load is aligned to 4, and another proves it is aligned to 16, they are both correct, it's just that the 16 is a more precise result, and thus we should prefer it as it gives us more information.

I think we should only propagate alignment info in one direction, forward, as it's the alignment of the base pointer that's the ground truth there.

I can't really argue against that from a practicality perspective, as I haven't come up with a use case for the reverse propagation. But it would be correct, as the alignment argument on any load/store input IR has to be accurate, right? https://llvm.org/docs/LangRef.html#load-instruction for reference. "It is the responsibility of the code emitter to ensure that the alignment information is correct. Overestimating the alignment results in undefined behavior. Underestimating the alignment may produce less efficient code."

I'm not against reverse propagation of the alignment, I'm just not convinced that it will buy us much on top of the forward propagation. If I'm wrong, it's trivial to add it.

I also suspect that it will have corner cases. E.g something like this:

Imagine that we have multiple nested function calls passing a semi-opaque structure pointer around. The intermediate functions only know about the structure header, and the leaf functions know about the opaque data that follows. The top-level caller guarantees that the pointer is properly aligned for the leaf function consumption, but intermediate layers do not.

Let's assume that we have two leaf functions f_aligned and f_default. f_aligned casts the pointer to the opaque blob to a type aligned by 16 and does a load.

Back-propagation would infer alignment for the access in the intermediate functions (let's assume they all get inlines, so we see all memory accesses).

The f_default function applies no explicit alignment and issues a load with no alignment set, so we can't infer anything from it. However, the caller also does not do anything extra about the alignment of the pointer it passes along, so it's not guaranteed to have high alignment.

Before inferring alignment IR is valid for the calling of both f_aligned and f_default.
However, after we reverse-propagate alignment info, the IR will be valid for the code path that involves f_aligned only.

I'm pretty much in the same camp actually. I added it based on another review, but we really don't have a clear motivator for it. I'm fine with leaving it out until we find such a motivator.

This was my comment. Here is my understanding and how I've thought about this. A chain of loads/stores is a group of these operations where we know they all have fixed offsets relative to each other (and since we're considering vectorizing they all have the same control dependency). The insight of this change is that the offsets can be used in combination with the alignments already on these memory operations to deduce better alignments for other elements in the chain. We're doing this by finding the operation with the greatest alignment and then upgrading the alignments of other operations based on their relative offset to this best alignment.

The question is whether we should apply this deduction to all elements or restrict it to only elements which have addresses greater than the most-aligned-operation. I think it's clearest conceptually and potentially improves perf to just apply the deduction to all members of the chain. Limiting to only items with greater addresses isn't going to buy us much in terms of reducing complexity or compile time (even if these are the only cases we'd need to address the IC multi-level struct issue).

Will we propagate align 16 to load of gep32?
What will be the alignment of 'load base' ?
What if the order of loads of gep_32 and gep_16 changes? We should end up with the same final alignment regardless of it. With the code iterating over the chain linearly, it may have trouble dealing with propagation across sibling branches.

I think we should apply align 16 to all 3 loads.

I'm not quite sure it it's literally a chain from the leaf to the root, or actually a graph with all known pointers derived from the same base.

So in this specific case, this chain would be split earlier in the algorithm in splitChainByContiguity, because there are memory gaps between the loads. But for the sake of your question let's ignore that.

The chain would be a list of elements and their offset from the base chain element. It's never a graph. Loads/stores that are not able to be boiled down to this format are tossed away. So your example would look like this:

{Inst = load i32, base, Offset = 0}
{Inst = load i32, gep_16, align 16, Offset = 16}
{Inst = load i32, gep_32, align 4, Offset = 32}

What if the order of loads of gep_32 and gep_16 changes? We should end up with the same final alignment regardless of it. With the code iterating over the chain linearly, it may have trouble dealing with propagation across sibling branches.

The basic block order does not matter to the algorithm, at least not after it proves there are no data dependencies in splitChainByMayAliasInstrs. After that point, it sorts the chain by Offset order, so any permutation of the original IR would deterministically end up as the same chain.

With the code iterating over the chain linearly, it may have trouble dealing with propagation across sibling branches.

This would be true if it was represented as a graph, but it is not. It is a list of tuples, each tuple containing an instruction and offset. Each instruction has been proven to have the same underlying object and all that matters for the algorithm is the offset from the "head" of the chain, which is often pointing directly to that underlying object.

I agree with Alex's framing of this solution, thank you for your explanation.

I think it's clearest conceptually and potentially improving perf to just apply the deduction to all members of the chain. Limiting to only items with greater addresses isn't going to buy us much in terms of reducing complexity or compile time

I think this is quite reasonable. Rather than "why should we propagate backwards", "why shouldn't we propagate backwards" is potentially a better question, because there is a risk of overfitting the solution to the use case if we don't answer that question.

Thank you for the explanation.
So, this case, the combination of forward and backward propagation is equivalent to finding the largest alignment at a known offset, and then applying that known alignment to other operations in the chain. Forward pass applies alignment to the operations from the max alignment point to the subsequent operations, and the reverse pass will apply it to the preceding ones.

We may be doing somewhat unnecessary work (we will be applying smaller alignment until we reach the max point and we'll eventually apply the max alignment on the reverse pass), but it should be valid.

Perhaps it would make sense to separate the find the max alignment from upgrading the alignment. Finding max alignment may be constrained on finding the large-enough value, so we can stop the search early, as we don't need alignments larger than the size of the largest aligned load. It also makes the forward/backward distinction moot.

as we don't need alignments larger than the size of the largest aligned load

Actually, I think we would want to search for alignments up to the memory size of the chain, and at that point it's probably not worth including an early exit case, as I don't think we have enough context at this point to build an easy heuristic.

However, I agree that the algorithm you are suggesting is more intuitive/readable and saves some calls to setMaxAlignment with the same number of iterations, even without the early exit. I just updated it and I think it's improved.

arsenm

I still have the feeling this is beyond the scope of what the vectorizer should be doing. Attributor / functionattrs + instcombine should be dealing with this

dakersnar · 2025-07-10T22:11:37Z

This has the align 16 with opt -attributor-enable=module -O3. The default attribute passes do a worse job it seems

Attributor / functionattrs

@arsenm Just in case my message gets buried in the other thread, I'll put it here too: I'd love to look closer at this, can you share details about your repro/commands? Thank you :)

[LoadStoreVectorizer] Propagate alignment through contiguous chain to…

15600f9

… improve vectorization

dakersnar requested review from jlebar and AlexMaclean June 25, 2025 16:12

dakersnar self-assigned this Jun 25, 2025

dakersnar added the vectorizers label Jun 25, 2025

llvmbot added the llvm:transforms label Jun 25, 2025

AlexMaclean reviewed Jun 25, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp Outdated Show resolved Hide resolved

dakersnar added 2 commits June 25, 2025 17:48

Address feedback, add reverse propagation, simplify and expand unit t…

b905f1c

…ests

For consistency, remove the delaying of load/store alignment upgrading

cd9e9f6

dakersnar commented Jun 25, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp Show resolved Hide resolved

AlexMaclean approved these changes Jun 25, 2025

View reviewed changes

dakersnar requested review from michalpaszkowski and arsenm June 27, 2025 22:37

arsenm reviewed Jul 2, 2025

View reviewed changes

dakersnar requested a review from Artem-B July 7, 2025 19:56

arsenm reviewed Jul 10, 2025

View reviewed changes

Refactor algorithm

9c99e0a

	for (uint64_t i = 0; i < NumElements; i++) {
	Value *Indices[2] = {
	Zero,
	ConstantInt::get(IdxType, i),
	};
	auto *Ptr = IC.Builder.CreateInBoundsGEP(AT, Addr, ArrayRef(Indices),
	Name + ".elt");
	auto EltAlign = commonAlignment(Align, Offset.getKnownMinValue());
	auto *L = IC.Builder.CreateAlignedLoad(AT->getElementType(), Ptr,
	EltAlign, Name + ".unpack");
	L->setAAMetadata(LI.getAAMetadata());
	V = IC.Builder.CreateInsertValue(V, L, i);
	Offset += EltSize;
	}

[LoadStoreVectorizer] Propagate alignment through contiguous chain #145733

Are you sure you want to change the base?

[LoadStoreVectorizer] Propagate alignment through contiguous chain #145733

Conversation

dakersnar commented Jun 25, 2025

Uh oh!

llvmbot commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexMaclean left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AlexMaclean left a comment

Choose a reason for hiding this comment

Uh oh!

dakersnar commented Jun 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dakersnar Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

dakersnar commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

llvmbot commented Jun 25, 2025 •

edited

Loading

dakersnar Jul 9, 2025 •

edited

Loading

dakersnar commented Jul 10, 2025 •

edited

Loading